[codex] Prevent MCP tool metadata hangs on malformed responses#110
Merged
Conversation
Constraint: MCP Python client can log malformed JSON-RPC errors without waking pending initialize/list_tools awaits. Rejected: Template-side timeout only | leaves SDK callers exposed to the same hang. Confidence: high Scope-risk: narrow Directive: Keep MCP metadata operations bounded so agent creation cannot wait indefinitely on malformed server responses. Tested: uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.py; uv run pytest tests/unittests/tool/test_mcp.py -q; uv run pytest tests/unittests/tool -q; git diff --check Not-tested: live MCP server returning malformed JSON-RPC error Closes: coop#82638110 Change-Id: I20569d10af7ba44c140ab19e446d7fc35870f7ec
Constraint: Reproduce malformed JSON-RPC response before changing SDK behavior. Rejected: Unit-only coverage | it did not exercise the MCP transport/context-manager path. Confidence: high Scope-risk: narrow Directive: Keep malformed MCP response handling bounded at the SDK MCP boundary; services must still return valid JSON-RPC errors. Tested: uv run pytest tests/e2e/test_mcp_malformed_response.py -q; uv run pytest tests/unittests/tool/test_mcp.py -q; uv run pytest tests/unittests/tool -q; uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.py tests/e2e/test_mcp_malformed_response.py; git diff --check Change-Id: Icde49bbfd79f29eb64acdab904f1f5df8df47bcd Not-tested: Full e2e suite against remote AgentRun services.
OhYee
approved these changes
Jun 2, 2026
Member
OhYee
left a comment
There was a problem hiding this comment.
Review: PR #110 — Prevent MCP tool metadata hangs on malformed responses
结论:LGTM ✅
修复了一个真实的生产问题:MCP Python transport 遇到畸形 JSON-RPC response(如 error.message = null)时,SDK 调用方会无限挂起。
改动分析
-
超时设计合理
- metadata 操作(initialize / list_tools):
min(Config.timeout, 30s),元数据加载应该很快 - 工具调用(call_tool):
Config.timeout或默认 600s,工具执行可能较慢 - 设计意图清晰:如果用户配了更短的 timeout,元数据操作也应该更快失败
- metadata 操作(initialize / list_tools):
-
ExceptionGroup 处理 —
_find_mcp_timeout_error递归搜索嵌套异常中的 TimeoutError,正确处理 asyncio 可能产生的 ExceptionGroup 包装。str(exc).startswith("MCP ")前缀检查虽然基于字符串,但只匹配自己抛出的 TimeoutError,可控 -
测试覆盖充分
- E2E:用真实的 malformed MCP server(FastAPI mock)验证超时行为
- 单元:mock
never_return协程验证 initialize 和 call_tool 的超时
-
附带清理 —
get_agentrun_signed_headersimport 从文件中间移到顶层,ToolSchemaunused import 移除
Minor
代码在 streamable / SSE 两个分支中有重复的 timeout wrapping 逻辑,但这是已有的代码结构(两种传输模式分开处理),不是本 PR 引入的。
🤖 Reviewed by Cortex + Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes an AgentRun SDK hang where ToolResource MCP metadata loading can wait indefinitely when the MCP Python transport logs a malformed JSON-RPC response, for example an error payload with
error.message = null.Aone: https://project.aone.alibaba-inc.com/v2/project/2139638/req/82638110
Root Cause
The MCP Python streamable HTTP transport can surface malformed JSON-RPC response parsing as an
Exceptionon the read stream. The defaultClientSessionhandler does not route that exception back to the pendinginitializeorlist_toolsrequest, so SDK callers can keep awaiting forever.Changes
initializeandlist_tools) with a 30s timeout so agent creation cannot hang indefinitely on malformed or silent MCP responses.Config.timeoutso tool calls also fail instead of waiting forever.Validation
uv run ruff check agentrun/tool/api/mcp.py tests/unittests/tool/test_mcp.pyuv run pytest tests/unittests/tool/test_mcp.py -quv run pytest tests/unittests/tool -qgit diff --checkNotes
The MCP service should still be fixed to return a valid JSON-RPC error with a string
error.message; this SDK change prevents the client-side hang while preserving that server-side requirement.